Skip to content

tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193)#194

Open
hyperpolymath wants to merge 2 commits into
mainfrom
feat/sweep-classifiers
Open

tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193)#194
hyperpolymath wants to merge 2 commits into
mainfrom
feat/sweep-classifiers

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

Summary

Durable tooling for the wrapper-sweep work that follows each of the four foundational reusable PRs filed today (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan).

Adds scripts/sweep-classifiers/:

What each classifier does

  1. Reads a paginated gh api /search/code JSON dump for the template
  2. Fetches each unique blob SHA exactly once (cached in $BLOBS_DIR)
  3. Classifies each blob (job-set match, line-count band, language matrix)
  4. Emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details>

Numbers produced across the four campaign templates

Template TRIVIAL / mechanical NEEDS_REVIEW Notable
mirror.yml 267/289 (92.4%) 22 16 slim 2-3 forge variants
secret-scanner 273/281 (97.2%) MISSING_SHELL_SECRETS 3 Only standards repo carries shell-secrets today
codeql 246/263 (93.5%) 17 11 custom 99-114-line workflows
hypatia-scan 249/255 (97.6%) 6 Pure propagation lag, no real customisation

Nested-path caveat (documented in README)

gh api /search/code with path:.github/workflows matches the path
PREFIX — monorepo nested workflow files (e.g.,
a2ml/bindings/deno/.github/workflows/hypatia-scan.yml) are EXCLUDED.
Verified for hypatia-scan: broader query without path: returns 704
results vs 255 path-filtered. The same effect likely applies to the
other three templates; sweep tooling must walk all
**/.github/workflows/<template>.yml paths.

Pattern

Same shape as scripts/apply-baseline.sh (paired with
scripts/tests/apply-baseline-test.sh) — committed durable tooling
rather than ephemeral /tmp scripts.

🤖 Generated with Claude Code

…-workflow campaign

Durable tooling for the wrapper-sweep work that follows each of the
foundational reusable PRs (#187 mirror, #190 secret-scanner, #192
codeql, #193 hypatia-scan).

Each classifier:
- reads a paginated `gh api /search/code` JSON dump
- fetches each unique blob SHA exactly once (cached in $BLOBS_DIR)
- emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details>

Classes vary per template but follow the same shape: TRIVIAL (canonical
match, mechanical wrapper) vs SLIM/MISSING/OLDER (propagation lag,
auto-upgrades on first run after wrapper merge) vs NEEDS_REVIEW
(custom workflow body, requires per-repo diff).

Numbers produced by these classifiers across the four campaign templates:
- mirror.yml      — 267/289 TRIVIAL (92.4%); 22 NEEDS_REVIEW
- secret-scanner  — 273/281 missing shell-secrets (97.2%); 1 TRIVIAL (standards itself)
- codeql          — 246/263 mechanical (93.5%); 17 NEEDS_REVIEW
- hypatia-scan    — 249/255 safe-to-standardize-up (97.6%); 6 NEEDS_REVIEW

README documents the path-filter caveat: `gh api /search/code` with
`path:.github/workflows` excludes monorepo-nested workflow files; the
broader `filename:` query (no path filter) catches them. For
hypatia-scan, the broader query returns 704 vs the 255 path-filtered
count — the ~449 nested copies also need wrappers when sweeps fire.
Same as #192 (codeql-reusable) — auto-merge enabled but zero workflow
runs against the head commit. Pushing empty commit to re-trigger CI.
hyperpolymath added a commit that referenced this pull request May 26, 2026
…ergence set (#205)

## Summary

5th and final reusable in the workflow convergence campaign (see #199
for the meta-doc). Consolidates the per-repo `scorecard.yml` workflow.

## Drift signal (full pagination + per-repo verified)

- **258** top-level estate deployments
- **626** nested copies in monorepos (asdf-tool-plugins,
developer-ecosystem, ssg-collection, standards, ambientops,
julia-ecosystem, etc. — Layer-2 truncation discovery via #204's helper)
- **46** unique blob SHAs / 17.8% structural drift
- Top SHA covers **100/258 (38.8%)** — highest dominant-cluster of the 5
campaigns
- Top 7 SHAs cover ~80%
- **100% mechanical drift, ZERO feature variance** — SPDX header
(PMPL-1.0 / PMPL-1.0-or-later / MPL-2.0), `upload-sarif` SHA-pin churn,
`permissions: read-all` vs `contents: read` wording

## Design

- One input: `runs-on` (default ubuntu-latest)
- No `secrets: inherit` — Scorecard uses `GITHUB_TOKEN` directly
- Caller MUST grant `security-events: write` + `id-token: write` on the
calling job (called-workflow permissions are capped by caller)
- Caller keeps own `on:` triggers + `concurrency:` group

## Per Layer-3 caveat from the campaign meta-doc

Nested workflows are inert — GitHub Actions only runs
`.github/workflows/` at the repo root. Sweeping the 626 nested copies is
single-source-of-truth cleanup, not security hardening.

## Campaign convergence set (closes with this PR)

| PR | Template |
|---|---|
| #187 | mirror-reusable.yml |
| #190 | secret-scanner-reusable.yml |
| #192 | codeql-reusable.yml |
| #193 | hypatia-scan-reusable.yml |
| #194 | sweep-classifier scripts |
| #199 | campaign meta-doc |
| #204 | list-workflow-paths.sh (bypass /search/code undercount) |
| **this** | **scorecard-reusable.yml** |

## Test plan

- [ ] Wrapper sweep (~258 top-level + ~626 nested) — owner-gated; not
part of this PR
- [ ] Update classify-* scripts to consume helper TSV — follow-up

🤖 Generated with [Claude Code](https://claude.com/claude-code)
hyperpolymath added a commit that referenced this pull request May 26, 2026
…consumers (#204)

## Summary

Two-commit change adding nested-path support to the sweep-classifier
pipeline:

1. **`scripts/sweep-classifiers/list-workflow-paths.sh`** — walks `gh
repo list` and queries each repo's Git Tree API directly. Bypasses two
compounding undercounts in `gh api /search/code`.
2. **All 4 `classify-*.sh` scripts updated** to consume the helper's TSV
output and emit the sweep-target path as an explicit column.

## Why the helper exists — 3 layers of undercount

1. **Layer 1 — path-prefix filter:** `path:.github/workflows` matches
the path PREFIX, excluding nested
`<subdir>/.github/workflows/<file>.yml` paths outright.
2. **Layer 2 — org-scope truncation:** even broad `filename:<file>.yml
org:<org>` queries hit internal caps. Validated against `scorecard.yml`:
broad query saw 152 paths (all flagged top-level); per-repo enumeration
found **626 additional nested copies** the broad query missed entirely.
3. **Layer 3 — nested workflows are inert:** GitHub Actions only runs
`.github/workflows/` at the repo root. Nested copies are vendored
templates / stale leftover. Security campaigns gain nothing from
sweeping nested copies; single-source-of-truth campaigns still benefit.

## Helper output

TSV, one row per matching workflow file:

```
<repo>\t<path>\t<blob-sha>\t<top-level|nested>
```

Cost: one Git Tree API call per repo (~300 calls), uses `core` bucket
(5000/hr) not throttled `code_search` (10/min).

## Classifier extensions

Each `classify-*.sh` now auto-detects input format from the first byte:
- `{` → JSONL from `gh /search/code` (legacy path)
- otherwise → TSV from `list-workflow-paths.sh` (preferred — handles
nested)

Output is unified to 7 columns: `repo \t path \t sha \t class \t reason
\t lines \t details`. The new `path` column carries the file's location
inside the repo, so sweeps can target nested copies as first-class
wrapper sites.

Shared `normalize_input` extracted into `_lib.sh`; each classifier
sources it.

## Validation

Smoke-tested both input paths:
- TSV (helper): classify-mirror.sh on scorecard-tuples.tsv (287 repos ×
top-level + nested) — fetches blobs and emits per-(repo, path) rows.
- JSONL (legacy): classify-mirror.sh on mirror-full.json — 267 TRIVIAL +
22 NEEDS_REVIEW, matching prior `/tmp/drift-survey/sweep-report.md`.

## Stacked on #194

`scripts/sweep-classifiers/` only exists once #194 merges. The diff
against `main` includes #194's files transitively; once #194 lands, this
PR narrows to just the helper + extensions.

## Standing follow-ups

- Once this lands, re-survey each candidate with the helper for
ground-truth wrapper-site counts before firing any sweep.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant